Search Results: "Erich Schubert"

22 December 2014

Erich Schubert: Java sum-of-array comparisons

This is a follow-up to the post by Daniel Lemire on a close topic.
Daniel Lemire hat experimented with boxing a primitive array in an interface, and has been trying to measure the cost.
I must admit I was a bit sceptical about his results, because I have seen Java successfully inlining code in various situations.
For an experimental library I occasionally work on, I had been spending quite a bit of time on benchmarking. Previously, I had used Google Caliper for it (I even wrote an evaluation tool for it to produce better statistics). However, Caliper hasn't seen much updates recently, and there is a very attractive similar tool at openJDK now, too: Java Microbenchmarking Harness (actually it can be used for benchmarking at other scale, too).
Now that I have experience in both, I must say I consider JMH superior, and I have switched over my microbenchmarks to it. One of the nice things is that it doesn't make this distinction of micro vs. macrobenchmarks, and the runtime of your benchmarks is easier to control.
I largely recreated his task using JMH. The benchmark task is easy: compute the sum of an array; the question is how much the cost is when allowing different data structures than double[].
My results, however, are quite different. And the statistics of JMH indicate the differences may be not significant, and thus indicating that Java manages to inline the code properly.
adapterFor       1000000  thrpt  50  836,898   13,223  ops/s
adapterForL      1000000  thrpt  50  842,464   11,008  ops/s
adapterForR      1000000  thrpt  50  810,343    9,961  ops/s
adapterWhile     1000000  thrpt  50  839,369   11,705  ops/s
adapterWhileL    1000000  thrpt  50  842,531    9,276  ops/s
boxedFor         1000000  thrpt  50  848,081    7,562  ops/s
boxedForL        1000000  thrpt  50  840,156   12,985  ops/s
boxedForR        1000000  thrpt  50  817,666    9,706  ops/s
boxedWhile       1000000  thrpt  50  845,379   12,761  ops/s
boxedWhileL      1000000  thrpt  50  851,212    7,645  ops/s
forSum           1000000  thrpt  50  845,140   12,500  ops/s
forSumL          1000000  thrpt  50  847,134    9,479  ops/s
forSumL2         1000000  thrpt  50  846,306   13,654  ops/s
forSumR          1000000  thrpt  50  831,139   13,519  ops/s
foreachSum       1000000  thrpt  50  843,023   13,397  ops/s
whileSum         1000000  thrpt  50  848,666   10,723  ops/s
whileSumL        1000000  thrpt  50  847,756   11,191  ops/s
The postfix is the iteration type: sum using for loops, with local variable for the length (L), or in reverse order (R); while loops (again with local variable for the length). The prefix is the data layout: the primitive array, the array using a static adapter (which is the approach I have been using in many implementations in cervidae) and using a "boxed" wrapper class around the array (roughly the approach that Daniel Lemire has been investigating. On the primitive array, I also included the foreach loop approach (for(double v:array) ).
If you look at the standard deviations, the results are pretty much identical, except for reverse loops. This is not surprising, given the strong inlining capabilities of Java - all of these codes will lead to next to the same CPU code after warmup and hotspot optimization.
I do not have a full explanation of the differences the others have been seeing. There is no "polymorphism" occurring here (at runtime) - there is only a single Array implementation in use; but this was the same with his benchmark.
Here is a visualization of the results (sorted by average):
Result boxplots
As you can see, most results are indiscernible. The measurement standard deviation is higher than the individual differences. If you run the same benchmark again, you will likely get a different ranking.
Note that performance may - drastically - drop once you use multiple adapters or boxing classes in the same hot codepath. Java Hotspot keeps statistics on the classes it sees, and as long as it only sees 1-2 different types, it performs quite aggressive optimizations instead of doing "virtual" method calls.

25 November 2014

Erich Schubert: Installing Debian with sysvinit

First let me note that I am using systemd, so these things here are untested by me. See e.g. Petter's and Simon's blog entries on the same overall topic.
According to the Debian installer maintainers, the only accepted way to install Debian with sysvinit is to use preseeding. This can either be done at the installer boot prompt by manually typing the magic spell:
preseed/late_command="in-target apt-get install -y sysvinit-core"
or by using a preseeding file (which is a really nice feature I used for installing my Hadoop nodes) to do the same:
d-i preseed/late_command string in-target apt-get install -y sysvinit-core
If you are a sysadmin, using preseeding can save you a lot of typing. Put all your desired configuration into preseeding files, put them on a webserver (best with a short name resolvable by local DNS). Let's assume you have set up the DNS name d-i.example.com, and your DHCP is configured such that example.com is on the DNS search list. You can also add a vendor extension to DHCP to serve a full URL. Manually enabling preseeding then means adding
auto url=d-i
to the installer boot command line (d-i is the hostname I suggested to set up in your DNS before, and the full URL would then be http://d-i.example.com/d-i/jessie/./preseed.cfg. Preseeding is well documented in Appendix B of the installer manual, but nevertheless will require a number of iterations to get everything work as desired for a fully automatic install like I used for my Hadoop nodes.

There might be an easier option.
I have filed a wishlist bug suggesting to use the tasksel mechanism to allow the user to choose sysvinit at installation time. However, it got turned down by the Debian installer maintainers quire rudely in a "No." - essentially this is a "shut the f... up and go away", which is in my opinion an inappropriate to discard a reasonable user wishlist request.
Since I don't intend to use sysvinit anymore, I will not be pursuing this option further. It is, as far as I can tell, still untested. If it works, it might be the least-effort, least-invasive option to allow the installation of sysvinit Jessie (except for above command line magic).
If you have interest in sysvinit, you (because I don't use sysvinit) should now test if this approach works.
  1. Get the patch proposed to add a task-sysvinit package.
  2. Build an installer CD with this tasksel (maybe this documentation is helpful for this step).
  3. Test whether the patch works. Report results to above bug report, so that others interested in sysvinit can find them easily.
  4. Find and fix bugs if it didn't work. Repeat.
  5. Publish the modified ("forked") installer, and get user feedback.
If you are then still up for a fight, you can try to convince the maintainers (or go the nasty way, and ask the CTTE for their opinion, to start another flamewar and make more maintainers give up) that this option should be added to the mainline installer. And hurry up, or you may at best get this into Jessie reloaded, 8.1. - chance are that the release manager will not accept such patches this late anymore. The sysvinit supporters should have investigated this option much, much earlier instead of losing time on the GR.
Again, I won't be doing this job for you. I'm happy with systemd. But patches and proof-of-concept is what makes open source work, not GRs and MikeeUSA's crap videos spammed to the LKML...
(And yes, I am quite annoyed by the way the Debian installer maintainers handled the bug report. This is not how open-source collaboration is supposed to work. I tried to file a proper wishlist bug reporting, suggesting a solution that I could not find discussed anywhere before and got back just this "No. Shut up." answer. I'm not sure if I will be reporting a bug in debian-installer ever again, if this is the way they handle bug reports ...)
I do care about our users, though. If you look at popcon "vote" results, we have 4179 votes for sysvinit-core and 16918 votes for systemd-sysv (graph) indicating that of those already testing jessie and beyond - neglecting 65 upstart votes, and assuming that there is no bias to not-upgrade if you prefer sysvinit - about 20% appear to prefer sysvinit (in fact, they may even have manually switched back to sysvinit after being upgraded to systemd unintentionally?). These are users that we should listen to, and that we should consider adding an installer option for, too.

19 November 2014

Erich Schubert: What the GR outcome means for the users

The GR outcome is: no GR necessary
This is good news.
Because it says: Debian will remain Debian, as it was the last 20 years.
For 20 years, we have tried hard to build the "universal operating system", and give users a choice. We've often had alternative software in the archive. Debian has come up with various tool to manage alternatives over time, and for example allows you to switch the system-wide Java.
You can still run Debian with sysvinit. There are plenty of Debian Developers which will fight for this to be possible in the future.
The outcome of this resolution says:
  • Using a GR to force others is the wrong approach of getting compatibility.
  • We've offered choice before, and we trust our fellow developers to continue to work towards choice.
  • Write patches, not useless GRs. We're coders, not bureocrats.
  • We believe we can do this, without making it a formal MUST requirement. Or even a SHOULD requirement. Just do it.
The sysvinit proponents may perceive this decision as having "lost". But they just don't realize they won, too. Because the GR may easily have backfired on them. The GR was not "every package must support sysvinit". It was also "every sysvinit package must support systemd". Here is an example: eudev, a non-systemd fork of udev. It is not yet in Debian, but I'm fairly confident that someone will make a package of it after the release, for the next Debian. Given the text of the GR, this package might have been inappropriate for Debian, unless it also supports systemd. But systemd has it's own udev - there is no reason to force eudev to work with systemd, is there?
Debian is about choice. This includes the choice to support different init systems as appropriate. Not accepting a proper patch that adds support for a different init would be perceived as a major bug, I'm assured.
A GR doesn't ensure choice. It only is a hammer to annoy others. But it doesn't write the necessary code to actually ensure compatibility.
If GNOME at some point decides that systemd as pid 1 is a must, the GR only would have left us three options: A) fork the previous version, B) remove GNOME altogether, C) remove all other init systems (so that GNOME is compliant). Does this add choice? No.
Now, we can preserve choice: if GNOME decides to go systemd-pid1-only, we can both include a forked GNOME, and the new GNOME (depending on systemd, which is allowed without the GR). Or any other solution that someone codes and packages...
Don't fear that systemd will magically become a must. Trust that the Debian Developers will continue what they have been doing the last 20 years. Trust that there are enough Debian Developers that don't run systemd. Because they do exist, and they'll file bugs where appropriate. Bugs and patches, that are the appropriate tools, not GRs (or trolling).

18 November 2014

Erich Schubert: Generate iptables rules via pyroman

Vincent Bernat blogged on using Netfilter rulesets, pointing out that inserting the rules one-by-one using iptables calls may leave your firewall temporarily incomplete, eventually half-working, and that this approach can be slow.
He's right with that, but there are tools that do this properly. ;-)
Some years ago, for a multi-homed firewall, I wrote a tool called Pyroman. Using rules specified either in Python or XML syntax, it generates a firewall ruleset for you.
But it also adresses the points Vincent raised:
  • It uses iptables-restore to load the firewall more efficiently than by calling iptables a hundred times
  • It will backup the previous firewall, and roll-back on errors (or lack of confirmation, if you are remote and use --safe)
It also has a nice feature for the use in staging: it can generate firewall rule sets offline, to allow you reviewing them before use, or transfer them to a different host. Not all functionality is supported though (e.g. the Firewall.hostname constant usable in python conditionals will still be the name of the host you generate the rules on - you may want to add a --hostname parameter to pyroman)
pyroman --print-verbose will generate a script readable by iptables-restore except for one problem: it contains both the rules for IPv4 and for IPv6, separated by #### IPv6 rules. It will also annotate the origin of the rule, for example:
# /etc/pyroman/02_icmpv6.py:82
-A rfc4890f -p icmpv6 --icmpv6-type 255 -j DROP
indicates that this particular line was produced due to line 82 in file /etc/pyroman/02_icmpv6.py. This makes debugging easier. In particular it allows pyroman to produce a meaningful error message if the rules are rejected by the kernel: it will tell you which line caused the rule that was rejected.
For the next version, I will probably add --output-ipv4 and --output-ipv6 options to make this more convenient to use. So far, pyroman is meant to be used on the firewall itself.
Note: if you have configured a firewall that you are happy with, you can always use iptables-save to dump the current firewall. But it will not preserve comments, obviously.

9 November 2014

Erich Schubert: GR vote on init coupling

Overregulation is bad, and the project is suffering from the recent Anti-Systemd hate campaigning.
There is nothing balanced about the original GR proposal. It is bullshit from a policy point of view (it means we must remove software from Debian that would not work with other inits, such as gnome-journal, by policy).At the same time, it uses manipulative language like "freedom to select a different init system" (as if this would otherwise be impossible) and "accidentally locked in". It is exactly this type of language and behavior which has made Debian quite poisonous the last months.
In fact, the GR pretty much says "I don't trust my fellow maintainers to do the right thing, therefore I want a new hammer to force my opinion on them". This is unacceptable in my opinion, and the GR will only demotivate contributors. Every Debian developer (I'm not talking about systemd upstream, but about Debian developers!) I've met would accept a patch that adds support for sysvinit to a package that currently doesn't. The proposed GR will not improve sysvinit support. It is a hammer to kick out software where upstream doesn't want to support sysvinit, but it won't magically add sysvinit support anywhere.
What some supporters of the GR may not have realized - it may as well backfire on them. Some packages that don't yet work with systemd would violate policy then, too... - in my opinion, it is much better to make the support on a "as good as possible, given available upstream support and patches" basis, instead of a "must" basis. The lock-in may come even faster if we make init system support mandatory: it may be more viable to drop software to satisfy the GR than to add support for other inits - and since systemd is the current default, software that doesn't support systemd are good candidates to be dropped, aren't they? (Note that I do prefer to keep them, and have a policy that allows keeping them ...)
For these reasons I voted:
  1. Choice 4: GR not required
  2. Choice 3: Let maintainers do their work
  3. Choice 2: Recommended, but not mandatory
  4. Choice 5: No decision
  5. Choice 1: Ban packages that don't work with every init system
Fact is that Debian maintainers have always been trying hard to allow people to choose their favorite software. Until you give me an example where the Debian maintainer (not upstream) has refused to include sysvinit support, I will continue to trust my fellow DDs. I've been considering to place Choice 2 below "further discussion", but essentially this is a no-op GR anyway - in my opinion "should support other init systems" is present in default Debian policy already anyway...
Say no to the haters.
And no, I'm not being unfair. One of the most verbose haters going by various pseudonyms such as Gregory Smith (on Linux Kernel Mailing list), Brad Townshend (LKML) and John Garret (LKML) has come forward with his original alias - it is indeed MikeeUSA, a notorious anti-feminist troll (see his various youtube "songs", some of them include this pseudonym). It's easy to verify yourself.
He has not contributed anything to the open source community. His songs and "games" are not worth looking at, and I'm not aware of any project that has accepted any of his "contributions". Yet, he uses several sock puppets to spread his hate.
The anti-systemd "crowd" (if it acually is more than a few notorious trolls) has lost all its credibility in my opinion. They spread false information, use false names, and focus on hate instead of improving source code. And worse, they tolerate such trolling in their ranks.

23 October 2014

Erich Schubert: Clustering 23 mio Tweet locations

To test scalability of ELKI, I've clustered 23 million Tweet locations from the Twitter Statuses Sample API obtained over 8.5 months (due to licensing restrictions by Twitter, I cannot make this data available to you, sorry.
23 million points is a challenge for advanced algorithms. It's quite feasible by k-means; in particular if you choose a small k and limit the number of iterations. But k-means does not make a whole lot of sense on this data set - it is a forced quantization algorithm, but does not discover actual hotspots.
Density-based clustering such as DBSCAN and OPTICS are much more appropriate. DBSCAN is a bit tricky to parameterize - you need to find the right combination of radius and density for the whole world. Given that Twitter adoption and usage is quite different it is very likely that you won't find a single parameter that is appropriate everywhere.
OPTICS is much nicer here. We only need to specify a minimum object count - I chose 1000, as this is a fairly large data set. For performance reasons (and this is where ELKI really shines) I chose a bulk-loaded R*-tree index for acceleration. To benefit from the index, the epsilon radius of OPTICS was set to 5000m. Also, ELKI allows using geodetic distance, so I can specify this value in meters and do not get much artifacts from coordinate projection.
To extract clusters from OPTICS, I used the Xi method, with xi set to 0.01 - a rather low value, also due to the fact of having a large data set.
The results are pretty neat - here is a screenshot (using KDE Marble and OpenStreetMap data, since Google Earth segfaults for me right now):
Screenshot of Clusters in central Europe
Some observations: unsurprisingly, many cities turn up as clusters. Also regional differences are apparent as seen in the screenshot: plenty of Twitter clusters in England, and low acceptance rate in Germany (Germans do seem to have objections about using Twitter; maybe they still prefer texting, which was quite big in Germany - France and Spain uses Twitter a lot more than Germany).
Spam - some of the high usage in Turkey and Indonesia may be due to spammers using a lot of bots there. There also is a spam cluster in the ocean south of Lagos - some spammer uses random coordinates [0;1]; there are 36000 tweets there, so this is a valid cluster...
A benefit of OPTICS and DBSCAN is that they do not cluster every object - low density areas are considered as noise. Also, they support clusters of different shape (which may be lost in this visualiation, which uses convex hulls!) and different size. OPTICS can also produce a hierarchical result.
Note that for these experiments, the actual Tweet text was not used. This has a rough correspondence to Twitter popularity "heatmaps", except that the clustering algorithms will actually provide a formalized data representation of activity hotspots, not only a visualization.
You can also explore the clustering result in your browser - the Google Drive visualization functionality seems to work much better than Google Earth.
If you go to Istanbul or Los Angeles, you will see some artifacts - odd shaped clusters with a clearly visible spike. This is caused by the Xi extraction of clusters, which is far from perfect. At the end of a valley in the OPTICS plot, it is hard to decide whether a point should be included or not. These errors are usually the last element in such a valley, and should be removed via postprocessing. But our OpticsXi implementation is meant to be as close as possible to the published method, so we do not intend to "fix" this.
Certain areas - such as Washington, DC, New York City, and the silicon valley - do not show up as clusters. The reason is probably again the Xi extraction - these region do not exhibit the steep density increase expected by Xi, but are too blurred in their surroundings to be a cluster.
Hierarchical results can be found e.g. in Brasilia and Los Angeles.
Compare the OPTICS results above to k-means results (below) - see why I consider k-means results to be a meaningless quantization?
k-means clusters
Sure, k-means is fast (30 iterations; not converged yet. Took 138 minutes on a single core, with k=1000. The parallel k-means implementation in ELKI took 38 minutes on a single node, Hadoop/Mahout on 8 nodes took 131 minutes, as slow as a single CPU core!). But you can see how sensitive it is to misplaced coordinates (outliers, but mostly spam), how many "clusters" are somewhere in the ocean, and that there is no resolution on the cities? The UK is covered by 4 clusters, with little meaning; and three of these clusters stretch all the way into Bretagne - k-means clusters clearly aren't of high quality here.
If you want to reproduce these results, you need to get the upcoming ELKI version (0.6.5~201410xx - the output of cluster convex hulls was just recently added to the default codebase), and of course data. The settings I used are:
-dbc.in coords.tsv.gz
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-pagefile.pagesize 500
-spatial.bulkstrategy SortTileRecursiveBulkSplit
-time
-algorithm clustering.optics.OPTICSXi
-opticsxi.xi 0.01
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 1000
-resulthandler KMLOutputHandler -out /tmp/out.kmz
and the total runtime for 23 million points on a single core was about 29 hours. The indexes helped a lot: less than 10000 distances were computed per point, instead of 23 million - the expected speedup over a non-indexed approach is 2400.
Don't try this with R or Matlab. Your average R clustering algorithm will try to build a full distance matrix, and you probably don't have an exabyte of memory to store this matrix. Maybe start with a smaller data set first, then see how long you can afford to increase the data size.

21 October 2014

Erich Schubert: Avoiding systemd isn't hard

Don't listen to trolls. They lie.
Debian was and continues to be about choice. Previously, you could configure Debian to use other init systems, and you can continue to do so in the future.
In fact, with wheezy, sysvinit was essential. In the words of trolls, Debian "forced" you to install SysV init!
With jessie, it will become easier to choose the init system, because neither init system is essential now. Instead, there is an essential meta-package "init", which requires you to install one of systemd-sysv sysvinit-core upstart. In other words, you have more choice than ever before.
Again: don't listen to trolls.
However, notice that there are some programs such as login managers (e.g. gdm3) which have an upstream dependency on systemd. gdm3 links against libsystemd0 and depends on libpam-systemd; and the latter depends on systemd-sysv systemd-shim so it is in fact a software such as GNOME that is pulling systemd onto your computer.
IMHO you should give systemd a try. There are some broken (SysV-) init scripts that cause problems with systemd; but many of these cases have now been fixed - not in systemd, but in the broken init script.
However, here is a clean way to prevent systemd from being installed when you upgrade to jessie. (No need to "fork" Debian for this, which just demonstrates how uninformed some trolls are ... - apart from Debian being very open to custom debian distributions, which can easily be made without "forking".)
As you should know, apt allows version pinning. This is the proper way to prevent a package from being installed. All you need to do is create a file named e.g. /etc/apt/preferences.d/no-systemd with the contents:
Package: systemd-sysv
Pin: release o=Debian
Pin-Priority: -1
from the documentation, a priority less than 0 disallows the package from being installed. systemd-sysv is the package that would enable systemd as your default init (/sbin/init).
This change will make it much harder for aptitude to solve dependencies. A good way to help it to solve the dependencies is to install the systemd-shim package explicitly first:
aptitude install systemd-shim
After this, I could upgrade a Debian system from wheezy to jessie without being "forced" to use systemd...
In fact, I could also do an aptitude remove systemd systemd-shim. But that would have required the uninstallation of GNOME, gdm3 and network-manager - you may or may not be willing to do this. On a server, there shouldn't be any component actually depending on systemd at all. systemd is mostly a GNOME-desktop thing as of now.
As you can see, the trolls are totally blaming the wrong people, for the wrong reasons... and in fact, the trolls make up false claims (as a fact, systemd-shim was updated on Oct 14). Stop listening to trolls, please.
If you find a bug - a package that needlessly depends on systemd, or a good way to remove some dependency e.g. via dynamic linking, please contribute a patch upstream and file a bug. Solve problems at the package/bug level, instead of wasting time doing hate speeches.

18 October 2014

Erich Schubert: Beware of trolls - do not feed

A particularly annoying troll has been on his hate crusade against systemd for months now.
Unfortunately, he's particularly active on Debian mailing lists (but apparently also on Ubuntu and the Linux Kernel mailing list) and uses a tons of fake users he keeps on setting up. Our listmasters have a hard time blocking all his hate, sorry.
Obviously, this is also the same troll that has been attacking Lennart Poettering.
There is evidence that this troll used to go by the name "MikeeUSA", and has quite a reputation with anti-feminist hate for over 10 years now.
Please, do not feed this troll.
Here are some names he uses on YouTube: Gregory Smith, Matthew Bradshaw, Steve Stone.
Blacklisting is the best measure we have, unfortunately.
Even if you don't like the road systemd is taking or Lennart Poetting personall - the behaviour of that troll is unacceptable to say the least; and indicates some major psychological problems... also, I wouldn't be surprised if he is also involved in #GamerGate.
See this example (LKML) if you have any doubts. We seriously must not tolerate such poisonous people.
If you don't like systemd, the acceptable way of fighting it is to write good alternative software (and you should be able to continue using SysV init or openRC, unless there is a bug, in Debian - in this case, provide a bug fix). End of story.

17 October 2014

Erich Schubert: Google Earth on Linux

Google Earth for Linux appears to be largely abandoned by Google, unfortunately. The packages available for download cannot be installed on a modern amd64 Debian or Ubuntu system due to dependency issues.
In fact, the adm64 version is a 32 bit build, too. The packages are really low quality, the dependencies are outdated, locales support is busted etc.
So here are hacky instructions how to install nevertheless. But beware, these instructions are a really bad hack.
  1. These instructions are appropriate for version 7.1.2.2041-r0. Do not use them for any other version. Things will have changed.
  2. Make sure your system has i386 architecture enabled. Follow the instructions in section "Configuring architectures" on the Debian MultiArch Wiki page to do so
  3. Install lsb-core, and try to install the i386 versions of these packages, too!
  4. Download the i386 version of the Google Earth package
  5. Install the package by forcing dependencies, via
    sudo dpkg --force-depends -i google-earth-stable_current_i386.deb
    
  6. As of now, your package manager will complain, and suggest to remove the package again. To make it happy, we have to hack the installed packages list. This is ugly, and you should make a backup. You can totally bust your system this way... Fortunately, the change we're doing is rather simple. As admin, edit the file /var/lib/dpkg/status. Locate the section Package: google-earth-stable. In this section, delete the line starting with Depends:. Don't add in extra newlines or change anything else!
  7. Now the package manager should believe the dependencies of Google Earth are fulfilled, and no longer suggest removal. But essentially this means you have to take care of them yourself!
Some notes on using Google Earth:

25 August 2014

Erich Schubert: Analyzing Twitter - beware of spam

This year I started to widen up my research; and one data source of interest was text because of the lack of structure in it, that makes it often challenging. One of the data sources that everybody seems to use is Twitter: it has a nice API, and few restrictions on using it (except on resharing data). By default, you can get a 1% random sample from all tweets, which is more than enough for many use cases.
We've had some exciting results which a colleague of mine will be presenting tomorrow (Tuesday, Research 22: Topic Modeling) at the KDD 2014 conference:
SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
Erich Schubert, Michael Weiler, Hans-Peter Kriegel
20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
You can also explore some (static!) results online at signi-trend.appspot.com

In our experiments, the "news" data set was more interesting. But after some work, we were able to get reasonable results out of Twitter as well. As you can see from the online demo, most of these fall into pop culture: celebrity deaths, sports, hip-hop. Not much that would change our live; and even less that wasn't before captured by traditional media.
The focus of this post is on the preprocessing needed for getting good results from Twitter. Because it is much easier to get bad results!
The first thing you need to realize about Twitter is that due to the media attention/hype it gets, it is full of spam. I'm pretty sure the engineers at Twitter already try to reduce spam; block hosts and fraud apps. But a lot of the trending topics we discovered were nothing but spam.
Retweets - the "like" of Twitter - are an easy source to see what is popular, but are not very interesting if you want to analyze text. They just reiterate the exact same text (except for a "RT " prefix) than earlier tweets. We found results to be more interesting if we removed retweets. Our theory is that retweeting requires much less effort than writing a real tweet; and things that are trending "with effort" are more interesting than those that were just liked.
Teenie spam. If you ever searched for a teenie idol on Twitter, say this guy I hadn't heard of before, but who has 3.88 million followers on Twitter, and search for Tweets addressed to him, you will get millions over millions of results. Many of these tweets look like this: Now if you look at this tweet, there is this odd "x804" at the end. This is to defeat a simple spam filter by Twitter. Because this user did not tweet this just once: instead it is common amongst teenie to spam their idols with follow requests by the dozen. Probably using some JavaScript hack, or third party Twitter client. Occassionally, you see hundreds of such tweets, each sent within a few seconds of the previous one. If you get a 1% sample of these, you still get a few then...
Even worse (for data analysis) than teenie spammers are commercial spammers and wannabe "hackers" that exercise their "sk1llz" by spamming Twitter. To get a sample of such spam, just search for weight loss on Twitter. There is plenty of fresh spam there, usually consisting of some text pretending to be news, and an anonymized link (there is no need to use an URL shortener such as bit.ly on Twitter, since Twitter has its own URL shortener t.co; and you'll end up with double-shortened URLs). And the hacker spam is even worse (e.g. #alvianbencifa) as he seems to have trojaned hundreds of users, and his advertisement seems to be a nonexistant hash tag, which he tries to get into Twitters "trending topics".
And then there are the bots. Plenty of bots spam Twitter with their analysis of trending topics, reinforcing the trending topics. In my opinion, bots such as trending topics indonesia are useless. No wonder there are only 280 followers. And of the trending topics reported, most of them seem to be spam topics...

Bottom line: if you plan on analyzing Twitter data, spend considerable time on preprocessing to filter out spam of various kind. For example, we remove singletons and digits, then feed the data through a duplicate detector. We end up discarding 20%-25% of Tweets. But we still get some of the spam, such as that hackers spam.
All in all, real data is just messy. People posting there have an agenda that might be opposite to yours. And if someone (or some company) promises you wonders from "Big data" and "Twitter", you better have them demonstrate their use first, before buying their services. Don't trust visions of what could be possible, because the first rule of data analysis is: Garbage in, garbage out.

22 April 2014

Erich Schubert: Kernel-density based outlier detection and the need for customization

Outlier detection (also: anomaly detection, change detection) is an unsupervised data mining task that tries to identify the unexpected.
Most outlier detection methods are based on some notion of density: in an appropriate data representation, "normal" data is expected to cluster, and outliers are expected to be further away from the normal data.
This intuition can be quantified in different ways. Common heuristics include kNN outlier detection and the Local Outlier Factor (which uses a density quotient). One of the directions in my dissertation was to understand (also from a statistical point of view) how the output and the formal structure of these methods can be best understood.
I will present two smaller results of this analysis at the SIAM Data Mining 2014 conference: instead of the very heuristic density estimation found in above methods, we design a method (using the same generalized pattern) that uses a best-practise from statistics: Kernel Density Estimation. We aren't the first to attempt this (c.f. LDF), but we actuall retain the properties of the kernel, whereas the authors of LDF tried to mimic the LOF method too closely, and this way damaged the kernel.
The other result presented in this work is the need to customize. When working with real data, using "library algorithms" will more often than not fail. The reason is that real data isn't as nicely behaved - it's dirty, it seldom is normal distributed. And the problem that we're trying to solve is often much narrower. For best results, we need to integrate our preexisting knowledge of the data into the algorithm. Sometimes we can do so by preprocessing and feature transformation. But sometimes, we can also customize the algorithm easily.
Outlier detection algorithms aren't black magic, or carefully adjusted. They follow a rather simple logic, and this means that we can easily take only parts of these methods, and adjust them as necessary for our problem at hand!
The article persented at SDM will demonstrate such a use case: analyzing 1.2 million traffic accidents in the UK (from data.gov.uk) we are not interested in "classic" density based outliers - this would be a rare traffic accident on a small road somewhere in Scotland. Instead, we're interested in unusual concentrations of traffic accidents, i.e. blackspots.
The generalized pattern can be easily customized for this task. While this data does not allow automatic evaluation, many outliers could be easily verified using Google Earth and search: often, historic imagery on Google Earth showed that the road layout was changed, or that there are many news reports about the dangerous road. The data can also be nicely visualized, and I'd like to share these examples with you. First, here is a screenshot from Google Earth for one of the hotspots (Cherry Lane Roundabout, North of Heathrow airport, which used to be a double cut-through roundabout - one of the cut-throughs was removed since):
Screenshot of Cherry Lane Roundabout hotspot
Google Earth is best for exploring this result, because you can hide and show the density overlay to see the crossroad below; and you can go back in time to access historic imagery. Unfortunately, KML does not allow easy interactions (at least it didn't last time I checked).
I have also put the KML file on Google Drive. It will automatically display it on Google Maps (nice feature of Drive, kudos to Google!), but it should also allow you to download it. I've also explored the data on an Android tablet (but I don't think you can hide elements there, or access historic imagery as in the desktop application).
With a classic outlier detection method, this analysis would not have been possible. However, it was easy to customize the method; and the results are actually more meaningful: instead of relying on some heuristic to choose kernel bandwidth, I opted for choosing the bandwidth by physical arguments: 50 meters is a reasonable bandwidth for a crossroad / roundabout, and for comparison a radius of 2 kilometers is used to model the typical accident density in this region (there should other crossroads within 2 km in Europe).
Since I advocate reproducible science, the source code of the basic method will be in the next ELKI release. For the customization case studies, I plan to share them as a how-to or tutorial type of document in the ELKI wiki; probably also detailing data preprocessing and visualization aspects. The code for the customizations is not really suited for direct inclusion in the ELKI framework, but can serve as an example for advanced usage.
Reference:
E. Schubert, A. Zimek, H.-P. Kriegel
Generalized Outlier Detection with Flexible Kernel Density Estimates
In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.
So TLDR of the story: A) try to use more established statistics (such as KDE), and B) don't expect an off-the-shelf solution to do magic, but customize the method for your problem.
P.S. if you happen to know nice post-doc positions in academia:
I'm actively looking for a position to continue my research. I'm working on scaling these methods to larger data and to make them work with various real data that I can find. Open-source, modular and efficient implementations are very important to me, and one of the directions I'd like to investigate is porting these methods to a distributed setting, for example using Spark. In order to get closer to "real" data, I've started to make these approaches work e.g. on textual data, mixed type data, multimedia etc. And of course, I like teaching; which is why I would prefer a position in academia.

13 February 2014

Erich Schubert: Google Summer of Code 2014 - not participating

Google Summer of Code 2014 is currently open for mentoring organizations to register. I've decided to not apply for GSoC with my data mining open source project ELKI anymore.

11 February 2014

Erich Schubert: Debian chooses systemd as default init

It has been all over the place. The Debian CTTE has chosen systemd over upstart as default init system by chairman call. This decision was overdue, as it was pretty clear that the tie will not change, and thus it will be up to chairman. There were no new positions presented, and nobody was being convinced of a different preference. The whole discussion had been escalating, and had started to harm Debian. Some people may not want to hear this, but another 10 ballots and options would not have changed this outcome. Repating essentially the same decision (systemd, upstart, or "other things nobody prefers") will do no good, but turn out the same result, a tie. Every vote counting I saw happen would turn out this tie. Half of the CTTE members prefer systemd, the other half upstart. The main problems, however, are these: So please, everybody get back to work now. If there ever is enough reason to overturn this decision, there are formal ways to do this, and there are people in the CTTE (half of the CTTE, actually) that will take care of this. Until then, live with the result of a 0.1 votes lead for systemd. Instead of pursuing a destructive hate campaign, why not improve your favorite init system instead. Oh, and don't forget about the need to spell out a policy for init system support requirements of packages.

9 February 2014

Erich Schubert: Minimizing usage of debian-multimedia packages

How to reduce your usage of debian-multimedia packages: As you might be aware, the "deb-multimedia" packages have seen their share of chaos. Originally, they were named "debian-multimedia", but it was then decided that they should better be called "deb-multimedia" as they are not an official part of Debian. The old domain was then grabbed by cybersquatters when it expired. While a number of packages remain indistributable for Debian due to legal issues - such as decrypting DVDs - and thus "DMO" remains useful to many desktop users, please note that for many of the packages, a reasonable version exists within Debian "main". So here is a way to prevent automatically upgrading to DMO versions of packages that also exist in Debian main: We will use a configuration option known as apt pinning. We will modify the priority of DMO package to below 100, which means they will not be automatically upgraded to; but they can be installed when e.g. no version of this package exists within Debian main. I.e. the packages can be easily installed, but it will prefer using the official versions. For this we need to create a file I named /etc/apt/preferences.d/unprefer-dmo with the contents:
Package: *
Pin: release o=Unofficial Multimedia Packages
Pin-Priority: 123
As long as the DMO archive doesn't rename itself, this pin will work; it will continue to work if you use a mirror of the archive, as it is not tied to the URL. It will not downgrade packages for you. If you want to downgrade, this will be a trickier process. You can use aptitude for this, and start with a filter (l key, as in "limit") of ?narrow(~i,~Vdmo). This will only show packages installed where the version number contains dmo. You can now patrol this list, enter the detail view of each package, and check the version list at the end for a non-dmo version. I cannot claim this will be an easy process. You'll probably have a lot of conflicts to clear up, due to different libavcodec* API versions. If I recall correctly, I opted to uninstall some dmo packages such as ffmpeg that I do not really need. In other cases, the naming is slightly different: Debian main has handbrake, while dmo has handbrake-gtk. A simple approach is to consider uninstalling all of these packages. Then reinstall as needed; since installing Debian packages is trivial, it does not hurt to deinstall something, does it? When reinstalling, Debian packages will be preferred over DMO packages. I prefer to use DFSG-free software wherever possible. And as long as I can watch my favorite series (Tatort, started in 1970 and airing episode 900 this month) and the occasional DVD movie, I have all I need. An even more powerful aptitude filter to review installed non-Debian software is ?narrow(~i,!~ODebian), or equivalently ?narrow(?installed,?not(?origin(Debian))). This will list all package versions, which cannot be installed from Debian sources. In particular, this includes versions that are no longer available (they may have been removed because of security issues!), software that has been installed manually from .deb files, or any other non-Debian source such as Google or DMO. (You should check the output of aptitude policy that no source claims to be o=Debian that isn't Debian though). This filter is a good health check for your system. Debian packages receive a good amount of attention with respect to packaging quality, maintainability, security and all of these aspects. Third party packages usually failed the Debian quality check in at least one respect. For example, Sage isn't in Debian yet; mostly because it has so many dependencies, that making a reliable and upgradeable installation is a pain. Similarly, there is no Hadoop in Debian yet. If you look at Hadoop packaging efforts such as Apache Bigtop, they do not live up to Debian quality yet. In particular, the packages have high redundancy, and re-include various .jar copies instead of sharing access to an independently packaged dependency. If a security issue arises with any of these jars, all the Hadoop packages will need to be rebuilt. As you can see, it is usually not because of Debian being too strict or too conservative about free software when software is not packaged. More often than not, it's because the software in question itself currently does not live up to Debian packaging quality requirements. And of course sometimes because there is too little community backing the software in question, that would improve the packaging. If you want your software to be included in Debian, try to make it packaging friendly. Don't force-include copies of other software, for example. Allow the use of system-installed libraries where possible, and provide upgrade paths and backwards compatibility. Make dependencies optional, so that your software can be pacakged even if a dependency is not yet available - and does not have to be removed if the dependency becomes a security risk. Maybe learn how to package yourself, too. This will make it easier for you to push new features and bug fixes to your userbase: instead of requiring your users to manually upgrade, make your software packaging friendly. Then your users will be auto-upgraded by their operating system to the latest version. Update: If you are using APT::Default-Release or other pins, these may override above pin. You may need to use apt-cache policy to find out if the pins are in effect, and rewrite your default-release pin into e.g.
Package: *
Pin: release a=testing,o=Debian
Pin-Priority: 912
for debugging, use a different priority for each pin, so you can see which one is in effect. Notice the o=Debian in this pin, which makes it apply only to Debian repositories.

24 January 2014

Erich Schubert: Definition of Data Science

Everything is "big data" now, and everything is "data science". Because these terms lack a proper, falsifiable definition.
A number of attempts to define them exist, but they usually only consist of a number of "nice to haves" strung together. For Big Data, it's the 3+ V's, and for Data Science, this diagram on Wikipedia is a typical example.
This is not surprising: effectively these term are all marketing, not scientific attempts at definiting a research domain.
Actually, my favorite definition is this, except that it should maybe read pink pony in the middle, instead of unicorn.
Data science has been called "the sexiest job" so often, this has recently led to an integer overflow.
The problem with these definitions is that they are open-ended. They name some examples (like "volume") but they essentially leave it open to call anything "big data" or "data science" that you would like to. This is, of course, a marketers dream buzzword. There is nothing saying that "picking my nose" is not big data science.

If we ever want to get to a usable definition and get rid of all the hype, we should consider a more precise definition; even when this means making it more exclusive (funnily enough, some people already called above open-ended definitions "elitist" ...).
Big data:
  • Must involve distributed computation on multiple servers
  • Must intermix computation and data management
  • Must advance over the state-of-the-art of relational databases, data warehousing and cloud computing in 2005
  • Must enable results that were unavailable with earlier approaches, or that would take substantially longer (runtime or latency)
  • Must be disruptively more data-driven
Data science:
  • Must incorporate domain knowledge (e.g. business, geology, etc.).
  • Must take computational aspects into account (scalability etc.).
  • Must involve scientific techniques such as hypothesis testing and result validation.
  • Results must be falsifiable.
  • Should involve more mathematics and statistics than earlier approaches.
  • Should involve more data management than earlier approaches (indexing, sketching&hashing etc.).
  • Should involve machine learning, AI or knowledge discovery algorithms.
  • Should involve visualization and rapid prototyping for software development.
  • Must satisfy at least one of these shoulds in a disruptive level.
But this is all far from a proper definition. Partially because these fields are so much in flux; but largely because they're just too ill-defined.
There is a lot of overlap, that we should try to flesh out. For example, data science is not just statistics. Because it is much more concerned with how data is organized and how the computations can be made efficiently. Yet often, statistics is much better at integrating domain knowledge. People coming from computation, on the other hand, usually care too little about the domain knowledge and falsifiability of their results - they're happy if they can compute anything.
Last but not least, nobody will be in favor of such a rigid definition and requirements. Because most likely, you will have to strip that "data scientist" label off your business card - and why bite the hand that feeds? Most of what I do certainly would not qualify as data science or big data anymore with an "elitist" definition. While this doesn't lessen my scientific results, it makes them less marketable.
Essentially, this is like a global "gentlemans agreement". Buzz these words while they last, then move on to the next similar "trending topic".

Maybe we should just leave these terms to the marketing folks, and let them bubble them till it bursts. Instead, we should just stick to the established and better defined terms...
Thank you.
Of course, sometimes you will have to play Buzzword Bingo. Nobody is going to stop you. But I will understand that you are doing "playing buzzword bingo", unless you get more precise.
Once you then have results that are so massively better, and really disrupted science, then you can still call it "data science" later on.
You have been seeing, I've been picking on the word "disruptive" a lot. As long as you are doing "business as usual", and focusing on off-the-shelf solution, it will not be disruptive. And it then won't be big data science, or a big data approach that yields major gains. It will be just "business as usual" with different labels, and return results as usual.
Let's face it. We don't just want big data or data science. What everybody is looking for is disruptive results, which will require a radical approach, not a slight modification involving slightly more computers of what you have been doing all along.

22 January 2014

Erich Schubert: The init wars

The init wars have recently caught a lot of media attention (e.g. heise, prolinux, phoronix). However, one detail that is often overlooked: Debian is debating over the default, while all of them are already supported to a large extend, actually. Most likely, at least two of them will be made mandatory to support IMHO.
The discussion seems to be quite heated, with lots of people trying to evangelize for their preferred system. This actually only highlights that we need to support more than one, as Debian has always been about choice. This may mean some extra work for the debian-installer developers, because choosing the init system at install time (instead of switching later) will be much easier. More often than not, when switching from one init system to another you will have to perform a hard reset.
If you want to learn about the options, please go to the formal discussion page, which does a good job at presenting the positions in a neutral way.
Here is my subjective view of the init systems:
  • SysV init is the current default, and thus deserves to be mentioned first. It is slow, because it is based on a huge series of shell scripts. It can often be fragile, but at the same time it is very transparent. For a UNIX system administrator, SysV init is probably the preferred choice. You only reboot your servers every year anyway.
  • upstart seems to be a typical Canonical project. It solves a great deal of problems, but apparently isn't good enough at it for everybody, and they fail at including anyone in their efforts. Other examples of these fails include Unity and Mir, where they also announced the project as-is, instead of trying to get other supporters on board early (AFAICT). The key problem to widespread upstart acceptance seems to be the Canonical Contributor License Agreement that many would-be contributors are unwilling to accept. The only alternative would be to fork upstart completely, to make it independent of Canonical. (Note that upstart nevertheless is GPL, which is why it can be used by Debian just fine. The CLA only makes getting patches and enhancements included in the official version hard.)
  • systemd is the rising star in the init world. It probably has the best set of features, and it has started to incorporate/replace a number of existing projects such as ConsoleKit. I.e. it not only manages services, but also user sessions. It can be loosely tied to the GNOME project which has started to rely on it more and more (much to the unhappyness of Canonical, who used to be a key player for GNOME; note that officially, GNOME chose to not depend on systemd, yet I see this as the only reliable combination to get a complete GNOME system running, and since "systemd can eventually replace gnome-session" I foresee this tie to become closer). As the main drawback, systemd as is will (apparently) only work with the Linux kernel, whereas Debian has to also support kFreeBSD, NetBSD, Hurd and the OpenSolaris kernels (some aren't officially supported by Debian, but by separate projects).
So my take: I believe the only reasonable default is systemd. It has the most active development community and widest set of features. But as it cannot support all architectures, we need mandatory support for an alternative init system, probably SysV. Getting both working reliably will be a pain, in particular since more and more projects (e.g. GNOME) tie themselves closely to systemd, and would then become Linux-only or require major patches.
I have tried only systemd on a number of machines, and unfortunately I cannot report it as "prime time ready" yet. You do have the occasional upgrade problems and incompatibilities, as it is quite invasive. From screensavers activating during movies to double suspends, to being unable to shutdown my system when logged in (systemd would treat the login manager as separate session, and not being the sole user it would not allow me to shut down), I have seen quite a lot of annoyances happen. This is an obvious consequence of the active development on systemd. This means that we should make the decision early, because we will need a lot of time to resolve all these bugs for the release.
There are more disruptions coming on the way. Nobody seems to have talked about kDBUS yet, the integration of an IPC mechanism like DBUS into the Linux kernel. It IMHO has a good chance of making it into the Linux kernel rather soon, and I wouldn't be surprised if it became mandatory for systemd soon after. Which then implies that only a recent kernel (say, mid-2014) version might be fully supported by systemd soon.
I would also like to see less GNOME influence in systemd. I have pretty much given up on the GNOME community, which is moving into a UI direction that I hate: they seem to only care about tablet and mobile phones for dumb users, and slowly turn GNOME into an android UI; selling black background as major UI improvements. I feel that the key GNOME development community does not care about developers and desktop users like me anymore (but dream of being the next Android), and therefore I have abandoned GNOME and switched to XFCE.
I don't give upstart much of a chance. Of course there are some Debian developers already involved in its development (employed by Canonical), so this will cause some frustration. But so far, upstart is largely an Ubuntu-only solution. And just like Mir, I don't see much future in it; instead I foresee Ubuntu going systemd within a few years, because it will want to get all the latest GNOME features. Ubuntu relies on GNOME, and apparently GNOME already has chosen systemd over upstart (even though this is "officially" denied).
Sticking with SysV is obviously the easiest choice, but it does not make a lot of sense to me technically. It's okay for servers, but more and more desktop applications will start to rely on systemd. For legacy reasons, I would however like to retain good SysV support for at least 5-10 more years.

But what is the real problem? After all, this is a long overdue decision.
  • There is too much advocacy and evangelism, from either side. The CTTE isn't really left alone to do a technical decision, but instead the main factors have become of political nature, unfortunately. You have all kinds of companies (such as Spotify) weigh in on the debate, too.
  • The tone has become quite aggressive and emotional, unfortunately. I can already foresee some comments on this blog post "you are a liar, because GNOME is now spelled Gnome!!1!".
  • Media attention. This upcoming decision has been picked up by various Linux media already, increasing the pressure on everybody.
  • Last but not least, the impact will be major. Debian is one of the largest distributions, last but not least used by Ubuntu and Steam, amongst others. Debian preferring one over the other will be a slap in somebodys face, unfortunately.
So how to solve it? Let the CTTE do their discussions, and stop flooding them with mails trying to influence them. There has been so much influencing going on, it may even backfire. I'm confident they will find a reasonable decision, or they'll decide to poll all the DDs. If you want to influence the outcome provide patches to anything that doesn't yet fully support your init system of choice! I'm sure there are hundreds of packages which do neither have upstart nor systemd support yet (as is, I currently have 27 init.d scripts launched by systemd, for example). IMHO, nothing is more convincing than have things just work, and of course, contributing code. We are in open source development, and the one thing that gets you sympathy in the community is to contribute code to someone elses project. For example, contribute full integrated power-management support into XFCE, if you include power management functionality.
As is, I have apparently 7 packages installed with upstart support, and 25 with systemd support. So either, everybody is crazy about systemd, or they have the better record of getting their scripts accepted upstream. (Note that this straw poll is biased - with systemd, the benefits of not using "legacy" init.d script may just be larger).

16 December 2013

Erich Schubert: Java Hotspot Compiler - a heavily underappreciated technology

When I had my first contacts with Java, probably around Java 1.1 or Java 1.2, it felt all clumsy and slow. And this is still the reputation that Java has to many developers: bloated source code and slow performance.
The last years I've worked a lot with Java; it would not have been my personal first choice, but as this is usually the language the students know best, it was the best choice for this project, the data mining framework ELKI.
I've learned a lot on Java since, also on debugging and optimizing Java code. ELKI contains a number of tasks that require a good number chrunching performance; something where Java particularly had the reputation of being slow.
I must say, this is not entirely fair. Sure, the pure matrix multiplication performance of Java is not up to Fortran (BLAS libraries are usually implemented in Fortran, and many tools such as R or NumPy will use them for the heavy lifting). But there are other tasks than matrix multiplication, too!
There is a number of things where Java could be improved a lot. Some of this will be coming with Java 8, others is still missing. I'd particularly like to see native BLAS support and multi-valued on-stack returns (to allow intrinsic sincos, for example).

In this post, I want to emphasize that usually the Hotspot compiler does an excellent job.
A few years ago, I have always been laughing at those that claimed "Java code can even be faster than C code"; because the Java JVM is written in C. Having had a deeper look at what the hotspot compiler does, I'm now saying: I'm not surprised that quite often, reasonably good Java code outperforms reasonably good C code.
In fact, I'd love to see a "hotspot" optimizer for C.
So what is it what makes Hotspot so fast? In my opinion, the key ingredient to hotspot performance is aggressive inlining. And this is exactly why "reasonably well written" Java code can be faster than C code written at a similar level.
Let me explain this at an example. Assuming we want to compute a pariwise distance matrix; but we want the code to be able to support arbitrary distance functions. The code will roughly look like this (not heavily optimized):
for (int i = 0; i < size; i++)  
  for (int j = i + 1; j < size; j++)  
    matrix[i][j] = computeDistance(data[i], data[j]);
   
 
In C, if you want to be able to choose computeDistance at runtime, you would likely make it a function pointer, or in C++ use e.g. boost::function or a virtual method. In Java, you would use an interface method instead, i.e. distanceFunction.distance().
In C, your compiler will most likely emit a jmp *%eax instruction to jump to the method to compute the distance; with virtual methods in C++, it would load the target method from the vtable and then jmp there. Technically, it will likely be a "register-indirect absolute jump". Java will, however, try to inline this code at runtime, i.e. it will often insert the actual distance function used at the location of this call.
Why does this make a difference? CPUs have become quite good at speculative execution, prefetching and caching. Yet, it can still pay off to save those jmps as far as I can tell; and if it is just to allow the CPU to apply these techniques to predict another branch better. But there is also a second effect: the hotspot compiler will be optimizing the inlined version of the code, whereas the C compiler has to optimize the two functions independently (as it cannot know they will be only used in this combination).
Hotspot can be quite aggressive there. It will even inline and optimize when it is not 100% sure that these assumptions are correct. It will just add simple tests (e.g. adding some type checks) and jump back to the interpreter/compiler when these assumptions fail and then reoptimize again.
You can see the inlining effect in Java when you use the -Xcomp flag, telling the Java VM to compile everything at load time. It cannot do as much speculative inlining there, as it does not know which method will be called and which class will be seen. Instead, it will have to compile the code using virtual method invocations, just like C++ would use for executing this. Except that in Java, every single method will be virtual (in C++, you have to be explicit). You will likely see a substantial performance drop when using this flag, it is not recommended to use. Instead, let hotspot perform its inlining magic. It will inline a lot by default - in particular tiny methods such as getters and setters.
I'd love to see something similar in C or C++. There are some optimizations that can only be done at runtime, not at compile time. Maybe not even at linking time; but only with runtime type information, and that may also change over time (e.g. the user first computes a distance matrix for Euclidean distance, then for Manhattan distance).

Don't get me wrong. I'm not saying Java is perfect. There are a lot of common mistakes, such as using java.util.Collections for primitive types, which comes at a massive memory cost and garbage collection overhead. The first thing to debug when optimizing Java applications is to check for memory usage overhead. But all in all, good Java code can indeed perform well, and may even outperform C code, due to the inlining optimization I just discussed; in particular on large projects where you cannot fine-tune inlining in C anymore.

Sometimes, Hotspot may also fail. Which is largely why I've been investigating these issues recently. In ELKI 0.6.0 I'm facing a severe performance regression with linear scans (which is actually the simpler codepath, not using indexes but using a simple loop as seen above). I had this with 0.5.0 before, but back then I was able to revert back to an earlier version that still performed good (even though the code was much more redundant). This time, I would have had to revert a larger refactoring that I wanted to keep, unfortunately.
Because the regression was quite severe - from 600 seconds to 1500-2500 seconds (still clearly better than -Xcomp) - I first assumed I was facing an actual programming bug. Careful inspection down to the assembler code produced by the hotspot VM did not reveal any such error. Then I tried Java 8, and the regression was gone.
So apparently, it is not a programming error, but Java 7 failed at optimizing it remotely as good as it did with the previous ELKI version!
If you are an Java guru, interested at tracking down this regression, feel free to contact me. It's in an open source project, ELKI. I'd be happy to have good performance even for linear-scans, and Java 7. But I don't want to waste any more hours on this, but instead plan to move on to Java 8 for other reasons (lambda expressions, which will greatly reduce the amount of glue coded needed), too. Plus, Java 8 is faster in my benchmarks.

2 November 2013

Erich Schubert: Numerical precision and "Big Data"

Everybody is trying (or pretending) to be "big data" these days. Yet, I have the impression that there are very few true success stories. People fight with the technology to scale, and have not even arrived at the point of doing detailed analysis - or even meta-analysis, i.e. whether the new solutions actuall perform better than the old "small data" approaches.
In fact, a lot of the "big data" approaches just reason "you can't do it with the existing solutions, so we cannot compare". Which is not exactly true.

In my experiments with large data (not big; it still fits into main memory) is that you have to be quite careful with your methods. Just scaling up existing methods does not always yield the expected results. The larger your data set, the more tiny problem surface that can ruin your computations.
"Big data" is often based on the assumption that just by throwing more data at your problem, your results will automatically become more precise. This is not true. On contrary: the larger your data, the more likely you have some contamination that can ruin everything.
We tend to assume that numerics and such issues have long been resolved. But while there are some solutions, it should be noted that they come at a price: they are slower, have some limitations, and are harder to implement.
Unforunately, they are just about everywhere in data analysis. I'll demonstrate it with a very basic example. Assume we want to compute the mean of the following series: [1e20, 3, -1e20]. Computing the mean, everybody should be able to do this, right? Well, let's agree that the true solution is 1, as the first and last term cancel out. Now let's try some variants:
  • Python, naive: sum([1e10, 3, -1e20])/3 yields 0.0
  • Python, NumPy sum: numpy.sum([1e10, 3, -1e20])/3 yields 0.0
  • Python, NumPy mean: numpy.mean([1e10, 3, -1e20]) yields 0.0
  • Python, less-known function: numpy.fsum([1e10, 3, -1e20])/3 yields 1.0
  • Java, naive: System.out.println( (1e20+3-1e20)/3 ); yields 0.0
  • R, mean: mean( c(1e20,3,-1e20) ) yields 0
  • R, sum: sum( c(1e20,3,-1e20) )/3 yields 0
  • Octave, mean: mean([1e20,3,-1e20]) yields 0
  • Octave, sum: sum([1e20,3,-1e20])/3 yields 0
So what is happening here? All of these functions (except pythons less known math.fsum) use double precision. With double precision, 1e20 + 3 = 1e20, as double can only retain 15-16 digits of precision. To actually get the correct result, you need to keep track of your error using additional doubles.
Now you may argue, this would only happen when having large differences in magnitude. Unfortunately, this is not true. It also surfaces when you have a large number of observations! Again, I'm using python to exemplify (because math.fsum is accurate).
> a = array(range(-1000000,1000001)) * 0.000001
> min(a), max(a), numpy.sum(a), math.fsum(a)
(-1.0, 1.0, -2.3807511517759394e-11, 0.0)
As it can be seen from the math.fsum function, solutions exist. For example Shewchuk's algorithm (which is probably what powers math.fsum). For many cases, Kahan summation will also be sufficient, which essentially gives you twice the precision of doubles.
Note that these issues become even worse once you use subtraction, such as when computing variance. never use the famous E[X^2]-E[X]^2 formula. It's mathematically correct, but when your data is not central (i.e. E[X] is not close to 0, and much smaller than your standard deviation) then you will see all kinds of odd errors, including negative variance; which may then yield NaN standard deviation:
> b = a + 1e15
> numpy.var(a), numpy.var(b)
(0.33333366666666647, 0.33594164452917774)
> mean(a**2)-mean(a)**2, mean(b**2)-mean(b)**2
(0.33333366666666647, -21532835718365184.0)
(as you can see, numpy.var does not use the naive single-pass formula; probably they use the classic straight forward two-pass approach)
So why do we not always use the accurate computations? Well, we use floating point with fixed precision because it is fast. And most of the time, when dealing with well conditioned numbers, it is easily accurate enough. To show the performance difference:
> import timeit
> for f in ["sum(a)", "math.fsum(a)"]:
>     print timeit.timeit(f, setup="import math; a=range(0,1000)")
30.6121790409
202.994441986
So unless we need that extra precision (e.g. because we have messy data with outliers of large magnitude) we might prefer the simpler approach which is roughly 3-6x faster (at least as long as pure CPU performance is concerned. Once I/O gets into play, the difference might just disappear altogether). Which is probably why all but the fsum function show the same inaccuracy: performance. In particular, as in 99% of situations the problems won't arise.

Long story. Short takeaway: When dealing with large data, pay extra attention to the quality of your results. In fact, even do so when handling small data that is dirty and contains outliers and different magnitudes. Don't assume that computer math is exact, floating point arithmetic is not.
Don't just blindly scale up approaches that seemed to work on small data. Analyze them carefully. And last but not least, consider if adding more data will actually give you extra precision.

27 September 2013

Erich Schubert: Big Data madness and reality

"Big Data" has been hyped a lot, and due to this now is burning down to get some reality checking. It's been already overdue, but it is now happening.
I have seen a lot of people be very excited about NoSQL databases; and these databases do have their use cases. However, you just cannot put everything into NoSQL mode and expect it to just work. That is why we have recently seen NewSQL databases, query languages for NoSQL databases etc. - and in fact, they all move towards relational database management again.
Sound new database systems seem to be mostly made up of three aspects: in-memory optimization instead of optimizing for on-disk operation (memory just has become a lot cheaper the last 10 years), the execution of queries on the servers that store the actual data (you may want to call this "stored procedures" if you like SQL, or "map-reduce" if you like buzzwords), and optimized memory layouts (many of the benefits of "colum store" databases come from having a single, often primitive, datatype for the full column to scan, instead of alternating datatypes in records.

However, here is one point you need to consider:
is your data actually this "big"? Big as in: Google scale.
I see people use big data and Hadoop a lot when they just shouldn't. I see a lot of people run Hadoop in a VM on their laptop. Ouch.
The big data technologies are not a one-size-fits-all solution. They are the supersize-me solution, and supersize just does not fit every task.
When you look at the cases where Hadoop is really successful, it is mostly in keeping the original raw data, and enabling people to re-scan the full data again when e.g. their requirements changed. This is where Hadoop is really good at: managing 100 TB of data, and allowing you to quickly filter out the few GB that you really need for your current task.
For the actual analysis - or when you don't have 100 TB, and a large cluster anyway - then just don't try to hadoopify everything.
Here is a raw number from a current experiment. I have a student work on indexing image data; he is implementing some state of the art techniques. For these, a large number of image features are extracted, and then clustering is used to find "typical" features to improve image search.
The benchmarking image set is 56 GB (I have others with 2 TB to use next). The subset the student is currently processing is 13 GB. Extracting 2.3 million 128 dimensional feature vectors reduces this to about 1.4 GB. As you can see, the numbers drop quickly.
State of the art seems to be to load the data into Hadoop, and run clustering (actually, this is more of a vector quantization than clustering) into 1000 groups. Mahout is the obvious candidate to run this on Hadoop.
However, as I've put a lot of effort into the data mining software ELKI, I considered also to try processing it in ELKI.
By cutting the data into 10 MB blocks, Mahout/Hadoop can run the clustering in 52x parallel mappers. k-Means is an iterative algorithm, so it needs multiple processes. I have fixed the number of iterations to 10, which should produce a good enough approximation for my use cases.
K-means is embarrassingly parallel, so one would expect the cluster to really shine at this task. Well, here are some numbers:

So what is happening? Why is ELKI beating Mahout by a factor of 10x?
It's (as always) a mixture of a number of things:
Let me emphasize this: I'm not saying, Hadoop/Mahout is bad. I'm saying: this data set is not big enough to make Mahout beneficial.

Conclusions: As long as your data fits into your main memory and takes just a few hours to compute, don't go for Hadoop.
It will likely be faster on a single node by avoiding the overhead associated with (current) distributed implementations.
Sometimes, it may also be a solution to use the cluster only for preparing the data, then get it to a powerful workstation, and analyze it there. We did do this with the images: for extracting the image features, we used a distributed implementation (not on Hadoop, but on a dozen PCs).
I'm not saying it will stay that way. I have plans for starting "Project HELKI", aka "ELKI on Hadoop". Because I do sometimes hit the barrier of what I can compute on my 8 core CPU in an "reasonable" amount of time. And of course, Mahout will improve over time, and hopefully lose some of its "Java boxing" weight.
But before trying to run everything on a cluster, always try to run it on a single node first. You can still figure out how to scale up later, once you really know what you need.
And last but not least, consider whether scaling up really makes sense. K-means results don't really get better with a larger data set. They are averages, and adding more data will likely only change the last few digits. Now if you have an implementation that doesn't pay attention to numerical issues, you might end up with more numerical error than you gained from that extra data.
In fact, k-means can effectively be accelerated by processing a sample first, and only refining this on the full data set then. And: sampling is the most important "big data" approach. In particular when using statistical methods, consider whether more data can really be expected to improve your quality.
Better algorithms: K-means is a very crude heuristic. It optimizes the sum of squared deviations, which may be not too meaningful for your problem. It is not very noise tolerant, either. And there are thousands of variants. For example bisecting k-means, which no longer is embarrassingly parallel (i.e. it is not as easy to implement on MapReduce), but took just 7 minutes doing 20 iterations for each split. The algorithm can be summarized as starting with k=2, then always splitting the largest cluster in two until you have the desired k. For many use cases, this result will be just as good as the full k-means result.
Don't get me wrong. There are true big data problems. Web scale search, for example. Or product and friend recommendations at Facebook scale. But chances are that you don't have this amount of data. Google probably doesn't employ k-means at that scale either. (actually, Google runs k-means on 5 mio keypoints for vector quantization; which, judging my experience here, can still be done one a single host; in particular with hierarchical approaches such as bisecting k-means)
Don't choose the wrong tool for the wrong problem!

20 August 2013

Erich Schubert: Google turning evil

I used to be a fan of the Google company. I used to think of this as a future career opportunity, as I've received a number of invitations to apply for a job with them.
But I am no longer a fan.
Recently, the company has in my opinion turned to the "evil" side; they probably became worse than Microsoft ever was and the "do no evil" principle is long gone.
So what has changed? In my perception, Google:
  • Is much more focused on making money now, than making technology happen.
  • Instead of making cool technology, it tries more and more to be a "hipster" thing and following the classic old 80-20 rule: get 80% of the users with 20% of the effort. Think of the GMail changes that received so much hate and the various services it shuts down (Code search, Reader, Googlecode downloads) and how much Google is becoming a walled garden.
  • Google leverages its search market dominance as well as Android to push their products that don't perform as good as desired. Plus for example. There is a lot of things I like about plus. In particular when you use it as a "blog" type of conversation tool instead of a site for sharing your private data, then it's much better than Facebook because of the search function and communities. (And I definitely wouldn't cry if a Facebook successor emerges). But what I hate is how Google tries hard to leverage their Android and YouTube power to force people to Plus. And now they are spreading even further, into TV (ChromeCast) and ISP (Google Fiber) markets.
  • Hangups, oops, Hangouts is another such example. I've been using XMPP for a long time. At some point, Google started operating an XMPP server, and it would actually "federate" with other servers, and it worked quite nicely. But at some point they decided they need to have more "hipster" features, like ponies. Or more likely, they decided that they want all these users to use Plus. So now they are moving it away from an open standard towards a walled garden. But since 90% of my use is XMPP outside of Google, I will instead just move away from them. Worse.
  • Reader. They probably shut this down to support Currents and on the long run move people over to Plus, too. Fortunately here, there now exist a number of good alternatives such as Feedly. But again: Google failed my needs.
  • Maps. Again an example where they moved from "works good" to "hipster crap". The first thing the new maps always greets me with is a gray screen. The new 3D view looks like an earthquake happened, without offering any actual benefit over the classic (and fast) satellite imagery. Worse. In fact my favorite map service right now is an OpenStreetmap offline map.
  • Google Glass is pure hipsterdom crap. I have yet to see an actual use case for it. People use it to show off, and that's about it. If you are serious about photos, you use a DSLR camera with a large objective and sensor. But face it: it's just distraction. If you want productive, it's actually best to turn of all chat and email and focus. Best productivity tip ever: Turn off email notifications. And turn of chat, always-on and Glass, too.
  • Privacy. When privacy-aware US providers actually recommend against anyone trusting their private data to a company with physical ties to the United States, then maybe it's time to look out for services in countries that value freedom higher than spying. I cannot trust Google on keeping my data private.
We are living in worrisome times. The U.S.A., once proud defenders of freedom, have become the most systematic spies in the world, even on their own people. The surveillance in the Eastern Bloc is no match for this machinery built the last decade. Say hello to Nineteen Eighty-Four. Surveillance, in a mixture of multi-state and commercial surveillance has indeed become omnipresent. Wireless sensors in trash cans track your device MACs. Your email is automatically scanned both by the NSA and Google. Your social friendships are tracked by Facebook and your iPhone.
I'm not being paranoid. These things are real, and it can only be explained with a "brainwashing" with both the constantly raised fear of terror (note that you are much more likely to be killed in a traffic accident or by a random gun crazy in the U.S.A. than by terrorists) and the constant advertisement for useless technology such as Google Glass. Why have so many people stopped fighting for their freedom?
So what now? I will look into options to move away stuff from Google (that is mostly, my email -- although I'd really like to see a successor to email finally emerge). Maybe I can find a number of good services located e.g. in Switzerland or Norway (who still seem to hold freedom in high respect - neither the UK nor Germany are an alternative these days). And I hope that some politicians will have the courage to openly discuss whether it may be necessary to break up Google into "Boogles" (Baby-Googles), just as Reagan did with AT&T. But unfortunately, todays politicians are really bad at such decisions, in particular when it might lower their short-run popularity. They are only good at making excuses.

Next.

Previous.